Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
A new science discipline has emerged within the last decade at the intersection of informatics, computer science and biology:Imageomics. Like most other -omics fields, Imageomics also uses emerging technologies to analyze biological data but from the images. One of the most applied data analysis methods for image datasets is Machine Learning (ML). In 2019, we started working on a United States National Science Foundation (NSF) funded project, known as Biology Guided Neural Networks (BGNN) with the purpose of extracting information about biology by using neural networks and biological guidance such as species descriptions, identifications, phylogenetic trees and morphological annotations (Bart et al. 2021). Even though the variety and abundance of biological data is satisfactory for some ML analysis and the data are openly accessible, researchers still spend up to 80% of their time preparing data into a usable, AI-ready format, leaving only 20% for exploration and modeling (Long and Romanoff 2023). For this reason, we have built a dataset composed of digitized fish specimens, taken either directly from collections or from specialized repositories. The range of digital representations we cover is broad and growing, from photographs and radiographs, to CT scans, and even illustrations. We have added new groups of vocabularies to the dataset management system including image quality metadata, extended image metadata and batch metadata. With the image quality metadata and extended image metadata, we aimed to extract information from the digital objects that can possibly help ML scientists in their research with filtering, image processing and object recognition routines. Image quality metadata provides information about objects contained in the image, features and condition of the specimen, and some basic visual properties of the image, while extended image metadata provides information about technical properties of the digital file and the digital multimedia object (Bakış et al. 2021, Karnani et al. 2022, Leipzig et al. 2021, Pepper et al. 2021, Wang et al. 2021) (see details on Fish-AIR vocabulary web page). Batch metadata is used for separating different datasets and facilitates downloading and uploading data in batches with additional batch information and supplementary files. Additional flexibility, built into the database infrastructure using an RDF framework, will enable the system to host different taxonomic groups, which might require new metadata features (Jebbia et al. 2023). By the combination of these features, along with FAIR (Findable, Accessable, Interoperable, Reusable) principles, and reproducibility, we provide Artificial Intelligence Readiness (AIR; Long and Romanoff 2023) to the dataset. Fish-AIR provides an easy-to-access, filtered, annotated and cleaned biological dataset for researchers from different backgrounds and facilitates the integration of biological knowledge based on digitized preserved specimens into ML pipelines. Because of the flexible database infrastructure and addition of new datasets, researchers will also be able to access additional types of data—such as landmarks, specimen outlines, annotated parts, and quality scores—in the near future. Already, the dataset is the largest and most detailed AI-ready fish image dataset with integrated Image Quality Management System (Jebbia et al. 2023, Wang et al. 2021).more » « less
-
We have been successfully developing Artificial Intelligence (AI) models for automatically classifying fish species using neural networks over the last three years during the “Biology Guided Neural Network” (BGNN) project*1. We continue our efforts in another broader project, “Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning”*2. One of the main topics in the Imageomics Project is “Morphological Barcoding”. Within the Morphological Barcoding study, we are trying to build a gold standard method to identify species in different taxonomic groups based on their external morphology. This list of characters will contain, but not be limited to, landmarks, quantitative traits such as measurements of distances, areas, angles, proportions, colors, histograms, patterns, shapes, and outlines. The taxonomic groups will be limited by the data available, and we will be using fish as the topic of interest in this preliminary study. In this current study, we have focused on extracting morphological characters that are relying on anatomical features of fish, such as location of the eye, body length, and area of the head. We developed a schematic workflow to describe how we processed the data and extract the information (Fig. 1). We performed our analysis on the segmented images produced by Karpatne Team within the BGNN project (Bart et al. 2021). Segmentation analysis was performed using Artificial Neural Networks - Semantic Segmentation (Long et al. 2015); the list of segments to be detected were given as eye, head, trunk, caudal fin, pectoral fin, dorsal fin, anal fin, pelvic fin for fish. Segmented images, metadata and species lists were given as input to the workflow. During the cleaning and filtering subroutines, a subset of data was created by filtering down to the desired segmented images with corresponding metadata. In the validation step, segmented images were checked by comparing the number of specimens in the original image to the separate bounding-boxed specimen images, noting: violations in the segmentations, counts of segments, comparisons of relative positions of the segments among one another, traces of batch effect; comparisons according to their size and shape and finally based on these validation criteria each segmented image was assigned a score from 1 to 5 similar to Adobe XMP Basic namespace. The landmarks and the traits to be used in the study were extracted from the current literature, while mindful that some of the features may not be extracted successfully computationally. By using the landmark list, landmarks have been extracted by adapting the descriptions from the literature on to the segments, such as picking the left most point on the head as the tip of snout and top left point on the pelvic fin as base of the pelvic fin. These 2D vectors (coordinates), are then fine tuned by adjusting their positions to be on the outline of the fish, since most of the landmarks are located on the outline. Procrustes analysis*3 was performed to scale all of the measurements together and point clouds were generated. These vectors were stored as landmark data. Segment centroids were also treated as landmarks. Extracted landmarks were validated by comparing their relative position among each other, and then if available, compared with their manually captured position. A score was assigned based on these comparisons, similar to the segmentation validation score. Based on the trait list definitions, traits were extracted by measuring the distances between two landmarks, angles between three landmarks, areas between three or more landmarks, areas of the segments, ratios between two distances or areas and between a distance and a square rooted area and then stored as trait data. Finally these values were compared within their own species clusters for errors and whether the values were still within the bounds. Trait scores were calculated from these error calculations similar to segmentation scores aiming selecting good quality scores for further analysis such as Principal Component Analysis. Our work on extraction of features from segmented digital specimen images has shown that the accuracy of the traits such as measurements, areas, and angles depends on the accuracy of the landmarks. Accuracy of the landmarks is highly dependent on segmentation of the parts of the specimen. The landmarks that are located on the outline of the body (combination of head and trunk segments of the fish) are found to be more accurate comparing to the landmarks that represents inner features such as mouth and pectoral fin in some taxonomic groups. However, eye location is almost always accurate, since it is based on the centroid of the eye segment. In the remaining part of this study we will improve the score calculation for segments, images, landmarks and traits and calculate the accuracy of the scores by comparing the statistical results obtained by analysis of the landmark and trait data.more » « less
An official website of the United States government
